Image-based salient object detection (SOD) has been extensively studied inthe past decades. However, video-based SOD is much less explored since therelack large-scale video datasets within which salient objects are unambiguouslydefined and annotated. Toward this end, this paper proposes a video-based SODdataset that consists of 200 videos (64 minutes). In constructing the dataset,we manually annotate all objects and regions over 7,650 uniformly sampledkeyframes and collect the eye-tracking data of 23 subjects that free-view allvideos. From the user data, we find salient objects in video can be defined asobjects that consistently pop-out throughout the video, and objects with suchattributes can be unambiguously annotated by combining manually annotatedobject/region masks with eye-tracking data of multiple subjects. To the best ofour knowledge, it is currently the largest dataset for video-based salientobject detection. Based on this dataset, this paper proposes an unsupervised baseline approachfor video-based SOD by using saliency-guided stacked autoencoders. In theproposed approach, multiple spatiotemporal saliency cues are first extracted atpixel, superpixel and object levels. With these saliency cues, stackedautoencoders are unsupervisedly constructed which automatically infer asaliency score for each pixel by progressively encoding the high-dimensionalsaliency cues gathered from the pixel and its spatiotemporal neighbors.Experimental results show that the proposed unsupervised approach outperforms30 state-of-the-art models on the proposed dataset, including 19 image-based &classic (unsupervised or non-deep learning), 6 image-based & deep learning, and5 video-based & unsupervised. Moreover, benchmarking results show that theproposed dataset is very challenging and has the potential to boost thedevelopment of video-based SOD.
展开▼